Sample Databases
From PostgreSQL wiki
Jump to navigationJump to searchMany database systems provide sample databases with the product. A good intro to popular ones that includes discussion of samples available for other databases is Sample Databases for PostgreSQL and More (2006).
One trivial sample that PostgreSQL ships with is the Pgbench. This has the advantage of being built-in and supporting a scalable data generator.
- MySQL has a popular sample database named Sakila. Sakila has been ported to many databases including Postgres.
- Pagila is a more idiomatic Postgres port of Sakila.
- PgFoundry had a collection of Postgres-compatible sample databases but it has not been updated since 2008.
- IMDB Data for JOB Workload, as used in the paper "How Good are Query Optimizers, Really?". This data was generated using a tool that is freely available on Github. It is conveniently available to download as a
pg_dump -Fc
dump. - IMDB - the original IMDB source data.
- The land registry file from data.gov.uk has details of land sales in the UK, going back several decades, and is 4.3GB as of May 2024 (this applies only to the "complete" file, "pp-complete.csv"). No registration required.
-- Download file "pp-complete.csv", which has all records. -- If schema changes/field added, consult: https://www.gov.uk/guidance/about-the-price-paid-data -- Create table: CREATE TABLE land_registry_price_paid_uk( transaction uuid, price numeric, transfer_date date, postcode text, property_type char(1), newly_built boolean, duration char(1), paon text, saon text, street text, locality text, city text, district text, county text, ppd_category_type char(1), record_status char(1)); -- Copy CSV data, with appropriate munging: COPY land_registry_price_paid_uk FROM '/path/to/pp-complete.csv' with (format csv, encoding 'win1252', header false, null '', quote '"', force_null (postcode, saon, paon, street, locality, city, district));
- AdventureWorks 2014 for Postgres - Scripts to set up the OLTP part of the go-to database used in training classes and for sample apps on the Microsoft stack. The result is 68 tables containing HR, sales, product, and purchasing data organized across 5 schemas. It represents a fictitious bicycle parts wholesaler with a hierarchy of nearly 300 employees, 500 products, 20000 customers, and 31000 sales each having an average of 4 line items. So it's big enough to be interesting, but not unwieldy. In addition to being a well-rounded OLTP sample, it is also a good choice to demonstrate ETL into a data warehouse. The code in some of the views demonstrates effective techniques for querying XML data.
- Mouse Genome sample data set. See instructions. Custom format dump, 1.9GB compressed, but restored database is tens of GB in size. MGI is the international database resource for the laboratory mouse, providing integrated genetic, genomic, and biological data to facilitate the study of human health and disease. MGI use PostgreSQL in production [1], providing direct protocol access to researchers, so the custom format dump is not an afterthought. Apparently updated frequently.
- Benchmarking databases such as DBT-2 or TPC-H can be used as samples.
- Freebase - Various wiki style data on places/people/things - up to 22GB compressed
- OMDB - Open Media database, ~30MB compressed, 300MB when loaded - https://github.com/df7cb/omdb-postgresql
- Data.gov - US federal government data collection, see also Sunlight Labs
- DBpedia - Wikipedia data export project
- eoddata - historic stock market data (requires registration - licence?)
- RITA - Airline On-Time Performance Data
- Openstreetmap - Openstreetmap source data
- NCBI - biological annotation from NCBI's ENTREZ system (updated daily)
- Airlines Demo Database - Airlines Demo Database provides database schema with several tables and meaningful content, which can be used for learning SQL and writing applications
- Stack Exchange Data Dump - Anonymized dump of all user-contributed content on the Stack Exchange network (Stack Overflow, Server Fault...) under cc-by-sa 3.0 license. Use this tool to import XML dumps in PostgresQL : https://github.com/Networks-Learning/stackexchange-dump-to-postgres
- The Museum of Modern Art (MoMA) collection data - This research dataset contains more than 130,000 records, representing all of the works that have been accessioned into MoMA’s collection and cataloged in our database. It includes basic metadata for each work, including title, artist, date made, medium, dimensions, and date acquired by the Museum. At this time, both datasets are available in CSV and JSON format, encoded in UTF-8.